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ABSTRACT 


To exploit the similarity information hidden in the hyper- 
link structure of the web, this paper introduces algorithms 
scalable to graphs with billions of vertices on a distributed 
architecture. The similarity of multi-step neighborhoods of 
vertices are numerically evaluated by similarity functions in- 
cluding SimRank [20], a recursive refinement of cocitation; 
PSimRank, a novel variant with better theoretical charac- 
teristics; and the Jaccard coefficient, extended to multi-step 
neighborhoods. Our methods are presented in a general 
framework of Monte Carlo similarity search algorithms that 
precompute an index database of random fingerprints, and 
at query time, similarities are estimated from the finger- 
prints. The performance and quality of the methods were 
tested on the Stanford Webbase [19] graph of 80M pages by 
comparing our scores to similarities extracted from the ODP 
directory [26]. Our experimental results suggest that the hy- 
perlink structure of vertices within four to five steps provide 
more adequate information for similarity search than single- 
step neighborhoods. 


Categories and Subject Descriptors 


H.3.3 [Information Storage and Retrieval]: Informa- 
tion Search and Retrieval; G.2.2 [Discrete Mathematics]: 
Graph Theory—Graph algorithms; G.3 [Mathematics of 
Computing]: Probability and Statistics—Probabilistic al- 
gorithms 


General Terms 
Algorithms, Theory, Experimentation 


Keywords 


similarity search, link-analysis, scalability, fingerprint 


1. INTRODUCTION 


The development of similarity search algorithms between 
web pages is motivated by the “related pages” queries of web 
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search engines and web document classification. Both appli- 
cations require efficient evaluation of an underlying similar- 
ity function, which extracts similarities from either the tex- 
tual content of pages or the hyperlink structure. This paper 
focuses on computing similarities solely from the hyperlink 
structure modeled by the web graph, with vertices corre- 
sponding to web pages and directed arcs to the hyperlinks 
between pages. In contrast to textual content, link structure 
is a more homogeneous and language independent source of 
information that is in general more resistant against spam- 
ming. The authors believe that complex link-based similar- 
ity functions with scalable implementations can play such an 
important role in similarity search as PageRank [27] does for 
query result ranking. 

Several link-based similarity functions have been suggested 
over the web graph. To exploit the information in multi- 
step neighborhoods, SimRank [20] and the Companion [11] 
algorithms were introduced by adapting link-based ranking 
schemes [27, 21]. Further methods arise from graph theory 
such as similarity search based on network flows [23]. We 
refer to [22], which contains an exhaustive list of link-based 
similarity search methods. 

Unfortunately, no scalable algorithm has so far been pub- 
lished that allows the computation of the above similarity 
scores in case of a graph with billions of vertices. First, 
all the above algorithms require random access to the web 
graph, which does not fit into main memory with standard 
graph representations. In addition, SimRank iterations up- 
date and store a quadratic number of variables: [20] reports 
experiments on graphs with less than 300K vertices. Finally, 
related page queries require off-line precomputation, since a 
document cannot be compared to all the others one-by-one 
at query time. It is not clear what we could precompute for 
an algorithm like the one in [23] with no information about 
the queried page. 

In this paper we give scalable algorithms that can be used 
to evaluate multi-step link-based similarity functions over 
billions of pages on a distributed architecture. With a single 
machine, we conducted experiments on a test graph of 80M 
pages. Our primary focus is SimRank, which recursively re- 
fines the cocitation measure analogously to how PageRank 
refines in-degree ranking [27]. We give an improved Sim- 
Rank variant; in addition, we also handle a similarity func- 
tion that naturally extends the Jaccard coefficient from one- 
step to multi-step neighborhoods. Notice that scalability 
here is non-trivial, since the Jaccard coefficient may involve 
extremely large sets: the multi-step neighborhood of a ver- 
tex usually contains a large portion of the pages [4]. 


All our methods are Monte Carlo: we precompute inde- 
pendent sets of fingerprints for the vertices, such that the 
similarities can be approximated from the fingerprints at 
query time. We only approximate the exact values; for- 
tunately, the precision of approximation can be easily in- 
creased on a distributed architecture by precomputing inde- 
pendent sets of fingerprints and querying them in parallel. 

We started to investigate the scalability of SimRank in [12], 
and we gave a Monte Carlo algorithm with the naive rep- 
resentation as outlined in the beginning of Section 2. The 
main contributions of this paper are summarized as follows: 


e In Section 2.1 we present a scalable algorithm to compute 
approximate SimRank scores by using a database of fin- 
gerprint trees, a compact and efficient representation of 
precomputed random walks. 


e In Section 2.2 we introduce and analyze PSimRank, a 
novel variant of SimRank with better theoretical proper- 
ties and a scalable algorithm. 


e In Section 2.3 Jaccard coefficient is naturally extended to 
multi-step neighborhoods with a scalable algorithm. 


e In Section 3 we show that all the proposed Monte Carlo 
similarity search algorithms are especially suitable for dis- 
tributed computing. 


e In Section 4 we prove that our Monte Carlo similarity 
search algorithms approximate the similarity scores with 
a precision that tends to one exponentially with the num- 
ber of fingerprints. 


e In Section 5 we report experiments about the quality and 
performance of the proposed methods evaluated on the 
Stanford WebBase graph of 80M vertices [19]. 


Due to space constraints, most of the proofs are only pre- 
sented in the full version [13] of this paper. 

In the remainder of the introduction we discuss related 
results, define “scalability,” and recall some basic facts about 
SimRank. 


1.1 Related Results 


Unfortunately the algorithmic details of “related pages” 
queries in commercial web search engines are not publicly 
available. We believe that an accurate similarity search al- 
gorithm should exploit both the hyperlink structure and 
the textual content. For example, the pure link-based algo- 
rithms like SimRank can be integrated with classical text- 
based information retrieval tools [1] by simply combining the 
similarity scores. Alternatively, the similarities can be ex- 
tracted from the anchor texts referring to pages as proposed 
by [8, 16]. 

Recent years have witnessed a growing interest in the scal- 
ability issue of link-analysis algorithms. Palmer et al. [28] 
formulated essentially the same scalability requirements that 
we will present in Section 1.2; they give a scalable algorithm 
to estimate the neighborhood functions of vertices. Analo- 
gous goals were achieved by the development of PageRank: 
Brin and Page [27] introduced PageRank algorithm using 
main memory of size proportional to the number of vertices. 
Then external memory extensions were published in [9, 15]. 
A large amount of research was done to attain scalability 
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for personalized PageRank [17, 14]. The scalability of Sim- 
Rank was also addressed by pruning [20], but this technique 
could only scale up to a graph with 300K vertices in the ex- 
periments of [20]. In addition, no theoretical argument was 
published about the error of approximating SimRank scores 
by pruning. In contrast, the algorithms of Section 2 were 
used to compute SimRank scores on a test graph of 80M 
vertices, and the theorems of Section 4 give bounds on the 
error of the approximation. 

The key idea of achieving scalability by Monte Carlo al- 
gorithms was inspired by the seminal papers of Broder et 
al. [5, 7] and Cohen [10] estimating the resemblance of text 
documents and size of transitive closure of graphs, respec- 
tively. Both papers utilize min-hashing, the fingerprinting 
technique for the Jaccard coefficient that was also applied 
in [16] to scale similarity search based on anchor text. The 
main contribution of Section 2.3 is that we are able to gener- 
ate fingerprints for multi-step neighborhoods with external 
memory algorithms. Monte Carlo algorithms with simulated 
random walks also play an important role in a different as- 
pect of web algorithms, when a crawler attempts to down- 
load a uniform sample of web pages and compute various 
statistics [18, 29, 2] or page decay [3]. We refer to the book 
of Motwani and Raghavan [25] for more theoretical results 
about Monte Carlo algorithms solving combinatorial prob- 
lems. 


1.2 Scalability Requirements 


In our framework similarity search algorithms serve two 
types of queries: the output of a sim(u,v) similarity query 
is the similarity score of the given pages u and v; the output 
of a relateda(u) related query is the set of pages for which 
the similarity score with the queried page u is larger than 
the threshold a. To serve queries efficiently we allow off-line 
precomputation, so the scalability requirements are formu- 
lated in the indexing-query model: we precompute an index 
database for a given web graph off-line, and later respond to 
queries on-line by accessing the database. 

We say that a similarity search algorithm is scalable if the 
following properties hold: 


e Time: The index database is precomputed within the 
time of a sorting operation, up to a constant factor. To 
serve a query the index database can only be accessed a 
constant number of times. 


e Memory: The algorithms run in external memory [24]: 
the available main memory is constant, so it can be arbi- 
trarily smaller than the size of the web graph. 


e Parallelization: Both precomputation and queries can 
be implemented to utilize the computing power and stor- 
age capacity of tens to thousands of servers intercon- 
nected with a fast local network. 


Observe that the time constraint implies that the index 
database cannot be too large. In fact our databases will be 
linear in the number V of vertices (pages). 

The memory requirements do not allow random access to 
the web graph. We will first sort the edges by their ending 
vertices using external memory sorting. Later we will read 
the entire set of edges sequentially as a stream, and repeat 
this process a constant number of times. 


1.3 Preliminaries about SimRank 


SimRank was introduced by Jeh and Widom [20] to for- 
malize the intuition that 


“wo pages are similar if they are referenced by 
similar pages.” 


The recursive SimRank iteration propagates similarity scores 
with a constant decay factor c € (0,1) for vertices u 4 v: 


KORO do DS simo’), 


u’EI(u) v’El(v) 


sime+i (u, v) = 


where I(x) denotes the set of vertices linking to x; if I(u) 
or I(v) is empty, then sime+i(u,v) = 0 by definition. For 
a vertex pair with u = v we simply let sime+i(u,v) = 1. 
The SimRank iteration starts with simo(u,v) = 1 for u = v 
and simo(u, v) = 0 otherwise. The SimRank score is defined 
as the limit limg—.oo simg(u, v); see [20] for the proof of con- 
vergence. Throughout this paper we refer to simg(u, v) as a 
SimRank score, and regard £ as a parameter of SimRank. 

The SimRank algorithm of [20] calculates the scores by 
iterating over all pairs of web pages, thus each iteration re- 
quires ©(V°) time and memory, where V denotes the num- 
ber of pages. Thus the algorithm does not meet the scalabil- 
ity requirements by its quadratic running time and random 
access to the web graph. 

We recall two generalizations of SimRank from [20], as 
we will exploit these results frequently. SimRank framework 
refers to the natural generalization that replaces the aver- 
age function in SimRank iteration by an arbitrary function 
of the similarity scores of pairs of in-neighbors. Obviously, 
the convergence does not hold for all the algorithms in the 
framework, but still sime is a well-defined similarity ranking. 
Several variants are introduced in [20] for different purposes. 

For the second generalization of SimRank, suppose that a 
random walk starts from each vertex and follows the links 
backwards. Let Tu,. denote the random variable equal to 
the first meeting time of the walks starting from u and v; 


Tuy = œO, if they never meet; and Tu, = 0, if u = v. 
In addition, let f be an arbitrary function that maps the 
meeting times 0,1,...,00 to similarity scores. 


Definition 1. The expected f-meeting distance for vertices 
u and v is defined as E(f(Tu,v))- 


The above definition is adapted from [20] apart from the 
generalization that we do not assume uniform, independent 
walks of infinite length. In our case the walks may be pair- 
wise independent, correlated, finite or infinite. For example, 
we will introduce PSimRank as an expected f-meeting dis- 
tance of pairwise coupled random walks in Section 2.2. 

The following theorem justifies the expected f-meeting 
distance as a generalization of SimRank. It claims that Sim- 
Rank is equal to the expected f-meeting distance with uni- 
form independent walks and f(t) = c’, where c denotes the 
decay factor of SimRank with 0 <c< 1. 


THEOREM 1. For uniform, pairwise independent set of 
reversed random walks of length £, the equality E(c™*) = 
sime(u,v) holds, whether £ is finite or not. 


The proof is published in [20] for the infinite case, and it 
can be easily extended to the finite case. 
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Algorithm 1 Indexing (naive method) and similarity query 


N=number of fingerprints, (=path length, c=decay factor. 
Indexing: Uses random access to the graph. 
1: for i := 1 to N do 
2: for every vertex j of the web graph do 
3: Fingerprint[i][j]|]:=random reversed path of 
length £ starting from j. 
Query sim(u,v): 
1: sim:=0 
: for i := 1 to N do 
Let k be the smallest offset with 
Fingerprint[i] [u] [k]=Fingerprintļ|i] [v] [k] 
4: if such k exists then 


wre 


5: sim:=sim+c* 
6: return sim/N 
2. MONTE CARLO SIMILARITY SEARCH 


ALGORITHMS 


In this section we give the first scalable algorithm to ap- 
proximate SimRank scores. In addition, we introduce new 


similarity functions accompanied by scalable algorithms: PSim- 


Rank and the extended Jaccard coefficient. 

All the algorithms fit into the framework of Monte Carlo 
similarity search algorithms that will be introduced through 
the example of SimRank. Recall that Theorem 1 expressed 
SimRank as the expected value sime(u, v) = E(c™”) for ver- 
tices u,v. Our algorithms generate reversed random walks, 
calculate the first meeting time Tu, and estimate sime(u, v) 
by c™”. To improve the precision of approximation, the 
sampling process is repeated N times and the independent 
samples are averaged. The computation is shared between 
indexing and querying as shown in Algorithm 1, a naive 
implementation. During the precomputation phase we gen- 
erate and store N independent reversed random walks of 
length £ for each vertex, and the first meeting time Tu,v is 
calculated at query time by reading the random walks from 
the precomputed index database. 

The main concept of Monte Carlo similarity search al- 
ready arises in this example. In general fingerprint refers 
to a random object (a random walk in the example of Sim- 
Rank) associated with a node in such a way, that the ex- 
pected similarity of a pair of fingerprints is the similarity 
of their nodes. The Monte Carlo method precomputes and 
stores fingerprints in an index database and estimates simi- 
larity scores at query time by averaging. The main difficul- 
ties of this framework are as follows: 


e During indexing (generating the fingerprints) we have to 
meet the scalability requirements of Section 1.2. For 
example, generating the random walks with the naive 
indexing algorithm requires random access to the web 
graph, thus we need to store all the links in main mem- 
ory. To avoid this, we will first introduce algorithms 
utilizing O(V) main memory and then algorithms using 
memory of constant size, where V denotes the number of 
vertices. These computational requirements are referred 
to as semi-external memory and external memory mod- 
els [24], respectively. The parallelization techniques will 
be discussed in Section 3. 


e To achieve a reasonably sized index database, we need a 
compact representation of the fingerprints. In the case of 


the previous example, the index database (including an 
inverted index for related queries) is of size 2- V - N -£4. In 
practical examples we have V ~ 10° vertices and N = 100 
fingerprints of length £ = 10, thus the database is in total 
8000 gigabytes. We will show a compact representation 
that allows us to encode the fingerprints in 2- V - N cells, 
resulting in an index database with a size of 800 GB. 


e We need efficient algorithms for evaluating queries. For 
queries the main idea is that the similarity matrix is 
sparse, for a page u there are relatively few other pages 
that have non-negligible similarity to u. We will give al- 
gorithms that enumerate these pages in time proportional 
to their number. 


2.1 SimRank 


The main idea of this section is that we do not generate 
totally independent sets of reversed random walks as in Al- 
gorithm 1. Instead, we generate a set of coalescing walks: 
each pair of walks will follow the same path after their first 
meeting time. (This coupling is commonly used in the the- 
ory of random walks.) More precisely, we start a reversed 
walk from each vertex. In each time step, the walks at dif- 
ferent vertices step independently to an in-neighbor chosen 
uniformly. If two walks are at the same vertex, they follow 
the same edge. 

Notice that we can still estimate sime(u,v) = E(c™”) 
from the first meeting time Tu,» of coalescing walks, since 
any pair of walks are independent until they first meet. We 
will show that the meeting times of coalescing walks can 
be represented in a surprisingly compact way by storing 
only one integer for each vertex instead of storing walks 
of length @. In addition, coalescing walks can be generated 
more efficiently by the algorithm discussed in Section 2.1.3 
than totally independent walks. 


2.1.1 Fingerprint trees 


A set of coalescing reversed random walks can be repre- 
sented in a compact and efficient way. The main idea is that 
we do not need to reconstruct the actual paths as long as we 
can reconstruct the first meeting times for each pair of them. 
To encode this, we define the fingerprint graph (FPG) for a 
given set of coalescing random walks as follows. 

The vertices of FPG correspond to the vertices of the web 
graph indexed by 1,2,...,V. For each vertex u, we add a 
directed edge (u,v) to the FPG for at most one vertex v 
with 


(1) v < wand the fingerprints of u and v first meet at time 
Tu,v < 00; 


(2) 


among vertices satisfying (1) vertex v has earliest meet- 
ing time Tu,v; 


(3) 


Furthermore we label the edge (u, v) with Tu,v. An example 
for a fingerprint graph is shown as Fig. 1. 

The most important property of the compact FPG rep- 
resentation that it still allows us to reconstruct Tu,» values 
with the following algorithm. For a pair of nodes u and v 
consider the unique paths in the FPG starting from u and 
v. If these paths have no vertex in common, then Tu,v = ©. 
Otherwise take the paths until the first common node w; let 
tı and t2 denote the labels of the edges on the paths point- 
ing to w; and let tı = 0 (or t2 = 0), if u = w (or v = w). 


and given (1-2), the index of v is minimal. 
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Figure 1: Representing the first meeting times of 
coalescing reversed walks of ui, u2, u3, u4 and us 
(above) with a fingerprint graph (below). For ex- 
ample, the fingerprints of u2 and us first meet at 
time Tuz,us = max{3, 4} = 4. 


Then Tu,» = max{tı, t2}. (See the example of Fig. 1.) The 
correctness of this algorithm with further properties of the 
FPG is summarized by the following lemma. 


LEMMA 2. Consider the fingerprint graph for a set of coa- 
lescing random walks. This graph is a directed acyclic graph, 
each node has out-degree at most 1, thus it is a forest of 
rooted trees with edges directed towards the roots. 

Consider the unique path in the fingerprint graph starting 
from vertex u. The indices of nodes it visits are strictly 
decreasing, and the labels on the edges are strictly increasing. 

Any first meeting time Tu,» can be determined by Tu,» = 
max{ti,t2} as detailed above. 


By the lemma, the fingerprint graph is a collection of 
rooted trees referred to as fingerprint trees. The main obser- 
vation for storage and query is that the partition of nodes 
into trees preserves the locality of the similarity function. 


2.1.2 Fingerprint database and query 


The first advantage of the fingerprint graph (FPG) is that 
it represents all first meeting times for a set of coalescing 
walks of length Z in compact manner. It is compact, since 
every vertex has at most one out-edge in an FPG, so the size 
of one graph is V, and N - V bounds the total size. This is 
a significant improvement over the naive representation of 
the walks with a size of N- V - @. 

The second important property of the fingerprint graph 
is that two vertices have non-zero estimated similarity iff 
they fall into the same component (the same fingerprint 
tree). Thus, when serving a related(u) query it is enough to 
read and traverse from each of the N fingerprint graphs the 
unique tree containing u. Therefore in a fingerprint data- 
base, we store the fingerprint graphs ordered as a collection 
of fingerprint trees, and for each vertex u we also store the 
identifiers of the N trees containing u. By adding the iden- 
tifiers the total size of the database is no more than 2- N-V. 

A related(u) query requires N + 1 accesses to the finger- 
print database: one for the tree identifiers and then N more 
for the fingerprint trees of u. A sim(u,v) query accesses 
the fingerprint database at most N + 2 times, by loading 
two lists of identifiers and then the trees containing both u 


11 be more precise we need V([log(V)] + [log(£)]) bits for an FPG 
to store the labelled edges. Notice that the weights require no more 
than [log(@)] = 4 bits for each vertex for typical value of £ = 10. 


Algorithm 2 Indexing (using 2-V main memory) 


N=number of fingerprints, /=length of paths. Uses sub- 
routine GenRndInEdges that generates a random in-edge for 
each vertex in the graph and stores its source in an ar- 
ray. 
1: for i := 1 to N do 
for every vertex j of the web graph do 
PathEnd[j] := j /*start a path from j*/ 
for k:=1 to £ do 
NextIn[] := GenRndInEdges(); 
for every vertex j with PathEnd[j]# “stopped” do 
PathEnd|[j]:=Nextln[PathEnd[j]] 
/*extend the path*/ 
SaveNewFPGEdges(PathEnd) 
Collect edges into trees and save as FPG;. 


and v. For both type of queries the trees can be traversed 
in time linear in the size of the tree. 

Notice that the query algorithms do not meet all the scal- 
ability requirements: although the number of database ac- 
cesses is constant (at most N+2), the memory requirement 
for storing and traversing one fingerprint tree may be as 
large as the number of pages V. Thus, theoretically the 
algorithm may use as much as V memory. 

Fortunately, in case of web data the algorithm performs 
as an external memory algorithm. As verified by our nu- 
merical experiments on 80M pages (see in Section 5.3) the 
average sizes of fingerprint trees are approximately 100-200 
for reasonable path lengths. Even the largest trees in our 
database had at most 10K—20K vertices, thus 50Kbytes of 
data needs to be read for each database access in worst case. 


2.1.3 Building the fingerprint database 


It remains to present a scalable algorithm to generate co- 
alescing sets of walks and compute the fingerprint graphs. 

As opposed to the naive algorithm generating the fin- 
gerprints one-by-one, we generate all fingerprints together. 
With one iteration we extend all partially generated finger- 
prints by one edge. To achieve this, we generate one uniform 
in-edge ej for each vertex j independently. Then extend 
with edge ej each of those fingerprints that have the same 
last node j. This method generates a coalescing set of walks, 
since a pair of walks will be extended with the same edge 
after they first meet. Furthermore, they are independent 
until the first meeting time. 

The pseudo-code is displayed as Algorithm 2, where Next- 
In[j] stores the starting vertex of the randomly chosen edge 
ej, and PathEnd[j] is the ending vertex of the partial finger- 
print that started from 7. To be more precise, if a group 
of walks already met, then PathEnd|j]= “stopped” for every 
member j of the group except for the smallest j. The Save- 
NewFPGEdges subroutine detects if a group of walks meets 
in the current iteration, saves the fingerprint tree edges cor- 
responding to the meetings and sets PathEnd|j]= “stopped” 
for all non-minimal members j of the group. SaveNewFPG- 
Edges detects new meetings by a linear time counting sort 
of the non-stopped elements of PathEnd array. 

The subroutine GenRndInEdges may generate a set of ran- 
dom in-edges with a simple external memory algorithm if 
the edges are sorted by the ending vertices. Notice that a 
significant improvement can be achieved by generating and 
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Figure 2: When SimRank fails: pages u and v have 
k witnesses for similarity, yet their SimRank score 
1 


is smaller than z. 


saving all the required random edge-sets together during a 
single scan over the edges of the web graph. Thus, all the 
N - € edge-scans can be replaced by one edge-scan saving 
many sets of in-edges. Then GenRndlnEdges sequentially 
reads the N - £ arrays of size V from disk. 

The algorithm outlined above fits into the semi-external 
memory model, since it utilizes 2- V main memory to store 
the PathEnd and Nextln arrays. (The counter sort operation 
of SaveNewFPGEdges may reuse Nextln array, so it does not 
require additional storage capacity.) The algorithm can be 
easily converted into the external memory model by keep- 
ing PathEnd and Nextln arrays on the disk and by replacing 
Lines 6-8 of Algorithm 2 with external sorting and merg- 
ing processes. Furthermore, at the end of the indexing the 
individual fingerprint trees can be collected with £ sorting 
and merging operations, as the longest possible path in each 
fingerprint tree is / (due to Lemma 2 the labels are strictly 
increasing but cannot grow over £). 


2.2 PSimRank 


In this section we give a new SimRank variant with prop- 
erties extending those of Minimax SimRank [20], a non- 
scalable algorithm that cannot be formulated in our frame- 
work. The new similarity function will be expressed as an 
expected f-meeting distance by modifying the distribution 
of the set of random walks and by keeping f(t) = c’. 

A deficiency of SimRank can be best viewed by an exam- 
ple. Consider two very popular web portals. Many users link 
to both pages on their personal websites, but these pages 
are not reported to be similar by SimRank. An extreme 
case is depicted on Fig. 2 with portals u and v having the 
same in-neighborhood of size k. Though the k pages are 
totally dissimilar in the link-based sense, we would still in- 
tuitively regard u and v as similar. Unfortunately SimRank 
is counter-intuitive in this case, as sime(u,v) = c+ + con- 
verges to zero with the number k of common in-neighbors. 


2.2.1 Coupled random walks 


We define PSimRank as the expected f-meeting distance 
of a set of random walks, which are not independent, as in 
case of SimRank, but are coupled so that a pair of them can 
find each other more easily. 

We solve the deficiency of SimRank by allowing the ran- 
dom walks to meet with higher probability when they are 
close to each other: a pair of random walks at vertices u’, v’ 


will advance to the same vertex (i.e., meet in one step) with 
nnw’) 


Tanor] of their in- 


probability of the Jaccard coefficient 
neighborhoods J(u’) and I(v’). 


Definition 2. PSimRank is the expected f-meeting dis- 
tance with f(t) = c’ (for some 0 < c < 1) of the following 
set of random walks. For each vertex u, the random walk Xu 


makes ¢ uniform independent steps on the transposed web 
graph starting from point u. For each pair of vertices u, v 
and time t, assume that the random walks are at position 
Xu(t) =u’ and X,(t) =v’. Then 


Ilu AI’) 

Tuoro) 
uniformly chosen vertex of I(u') N I(v’); 
CONICA] 
I(u’)UI(v’) 
form vertex in I(u’) \ I(v’) and the walk X, steps to an 
independently chosen uniform vertex in I(v’); 

CONICHE) 
CASCA) 
form vertex in I(v’) \ I(u’) and the walk Xu steps to an 
independently chosen uniform vertex in I(u’). 


e with probability they both step to the same 


e with probability the walk X, steps to a uni- 


e with probability the walk X, steps to a uni- 


We give a set of random walks satisfying the coupling of 
the definition. For each time t > 0 we choose an independent 
random permutation o+ on the vertices of the web graph. At 
time t if the random walk from vertex u is at Xu(t) = wu’, 
it will step to the in-neighbor with smallest index given by 
the permutation cz, i.e., 


Xu(t+1) = argmin o;(u”) 
u” EI(u’) 


It is easy to see that the random walk X, takes uniform 
independent steps, since we have a new permutation for each 
step. The above coupling is also satisfied, since for any 
pair u’,v' the vertex argmin,,¢7(y/)ur(»") 7t(w) falls into the 
sets I(u’) NI(v’), I(u’) \ Iw’), I(v’) \ I(u’) with respective 
probabilities 


E(w’) Tw’) (E(u) \ CH] 


He) a0") q ODAO 
Hw) UIN) E) U Lv’) | 


Fu) uw) 


2.2.2 PSimRank in SimRank framework 


Now we prove that PSimRank is in the SimRank frame- 
work, i.e., the scores can be formulated by iterations that 
propagate similarities over the pairs of in-neighbors analo- 
gously to SimRank. The PSimRank-iterations provide an 
exact quadratic algorithm to compute PSimRank scores. 
Furthermore, the iterative formulation indicates that PSim- 
Rank scores are determined by Definition 2 and the values 
do not depend on the actual choice of the coupling. 

Let Tu,, denote the first meeting time of the walks of 
Xu, Xo starting from vertices u, v; and Tu,, = oo if the walks 
never meet. Then PSimRank scores for path length £ can be 
expressed by definition as psim,(u, v) = E(c™”). It is trivial 
that psimg(u, v) = 1, if u = v; and otherwise psimg(u, v) = 0. 
By applying the law of total expectation on the first step of 
the walks X, and X», and time shift we get the following 
PSimRank iterations: 


psimyi,(u,v) = 1l,ifw=v; 

psimy.,(u,v) = 0, if I(u) = or I(v) = 9; 
: = (war (v) | 

psimp (u,v) = c- ToT * 1+ 


HWV., 1 : row 
TOTO TONO 2. Psime(w',v")-+ 
UONG 
v'EI(v) 
[Z(v)\I(u)| 1 A roy 
ETATO Teter , 2 Psime(w', v’) |. 


v'EI(v)\ (u) 
u'EI(u) 
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2.2.3 Computing PSimRank 


To achieve a scalable algorithm for PSimRank we mod- 
ify the SimRank indexing and query algorithms introduced 
in Section 2.1. The following result allows us to use the 
compact representation of fingerprint graphs. 


LEMMA 3. Any set of random walks satisfying the PSim- 
Rank requirements are coalescing, i.e., any pair follows the 
same path after their first meeting time. 


To apply the indexing algorithm of SimRank, we only 
need to ensure the pairwise coupling. This can be accom- 
plished by simply replacing the GenRndInEdges procedure. 
Recall, that for SimRank this procedure generated one inde- 
pendent, uniform in-edge for each vertex v in the graph. In 
case of PSimRank, GenRndInEdges chooses a permutation o 
at random; and then for each vertex v the in-neighbor with 
smallest index under the permutation ø is selected, i.e., ver- 
tex argmin,,¢7(,) 0(v’) is chosen. 

As in the case of the GenRndInEdges for SimRank, all the 
required sets of random in-edges can be generated within a 
single scan over the edges of the web graph, if the edges are 
sorted by the ending vertices. The random permutations can 
be stored in small space by random linear transformations 
as in [6]. With this method the external memory implemen- 
tation of SimRank can be extended to PSimRank. 


2.3 Extended Jaccard coefficient 


In this section we formally define the extended Jaccard 
coefficient, and give efficient (Monte Carlo) approximation 
algorithms in the indexing-query model by applying min- 
hashing [5, 7], the well-known fingerprinting technique for 
estimating Jaccard coefficient between arbitrary sets. The 
main contribution of this section is that we give semi-external 
memory, external memory and distributed algorithms sim- 
ilar to PageRank iterations [27, 9] that compute the min- 
hash fingerprints for the multi-step neighborhoods of ver- 
tices. The proposed methods can be further parallelized 
using the methods described in Section 3. 

The extended Jaccard coefficient is defined as the expo- 
nentially weighted sum of the Jaccard coefficients of larger 
neighborhoods. 


Definition 8. Let I;,(v) be the k-in-neighborhood of v, i.e., 
the set of vertices from where vertex v can be reached using 
at most k directed edges. The extended Jaccard coefficient, 
XJaccard for length £ of vertices u and v is defined as 


ele) 


We will use the following min-hash fingerprinting tech- 
nique for Jaccard coefficients [5, 7]: take a random permu- 
tation o of the vertices and represent each set I,(v) with the 
minimum value of this permutation over the set Ip(v) as a 
fingerprint. Then for each distance k and vertices u, v the 
probability of these fingerprints to match equals the Jaccard 
Keo, We can use this for each k = 1,...,£ 
to get an £ sized fingerprint of each vertex, from which the 
extended Jaccard coefficients can be approximated for any 
pair of vertices. 

More precisely, we calculate the following fingerprint for 
each vertex v and each k = 1,...,& 


coefficient 


min a(v’) 


fpą(v) = eT 


Algorithm 3 Precomputing extended Jaccard coefficients 
N=number of fingerprints, 2=length of fingerprints. 
1: for i := 1 to N do 
generate a random permutation o. 
for every vertex j of the web graph do 
NFP[j]:=o(j) /*start the fingerprint */ 
for k:=1 to £ do 
FP[]:=NFP{] 
for every edge (u,v) of the web graph do 
NFP[v]:=min(NFP[v],FP[w]) 
save array NFP] as FP;[] 
Merge arrays FPx, and create inverted index. 


=n 


Then by taking these as random variables we get a prob- 
abilistic formulation (note that we use the same random 
permutation o for each step): 


PIGNE (Sea antes tye O) 


Using this equivalence we can take N independent sample 
to generate N sets of fingerprints. Upon a query xjac;(u, v) 
we load all the fingerprints for u and v, and average the re- 
sults of them to get an unbiased estimate of xjac,(u, v). For 
serving related queries we load the fingerprints of the queried 
page and use standard inverted indexing techniques to find 
all the pages that have matching parts in their fingerprints. 

Serving XJaccard queries requires a database of size 2- V - 
N - £, a similarity query uses two database accesses, and a 
related query uses up to 1+ N -£ database accesses. As we 
will show in Section 5, the preferred length of fingerprints 
is approximately £ = 4 on the web graph, thus these fig- 
ures are still reasonable. Furthermore, the factor N can be 
eliminated by using N-way parallelization, as discussed in 
Section 3. 


2.3.1 Precomputation of extended Jaccard coefficient 


We give a semi-external memory algorithm first. The key 
observation is that we use the same permutation for gener- 
ating all steps of the fingerprint, which allows the following 
recursion: 


fp, (u) = fp,_1(w’) 


min 
u’EI(u)U{u} 
Using this formula we can extend the fingerprints by one step 
using one edge-scan and the fingerprints of the previous step 
(see Algorithm 3). 


2.3.2 External memory and distributed indexing 


Algorithm 3 for semi-external memory indexing of ex- 
tended Jaccard coefficients is very similar to the classic Page- 
Rank computing method using power-iteration: each itera- 
tion scans the entire edge-set and updates a vector (indexed 
by the vertices) using the vector computed by the previ- 
ous iteration. This allows us to adapt the external memory 
PageRank algorithms [9, 15] and the distributed indexing 
technique [14] designed for personalized PageRank. 

In total with N = 100 and £ = 4 the precomputation 
costs for extended Jaccard coefficients are thus similar to 
the precomputation cost for 400 PageRank iterations, with 
one remarkable difference: while PageRank can only be com- 
puted sequentially, the precomputation of extended Jaccard 
coefficients can be parallelized up to N-way. 
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3. MONTE CARLO PARALLELIZATION 


In this section we discuss the parallelization possibilities 
of our methods. We show that all of them exhibit features 
(such as fault tolerance, load balancing and dynamic adap- 
tation to workload) which makes them extremely applicable 
in large-scale web search engines. 

All similarity methods we have given in this paper are 
organized around the same concepts: 


e we compute a similarity measure by averaging N inde- 
pendent samples from a certain random variable; 


e the independent samples are stored in N instances of an 
index database, each capable of producing a sample of 
the random variable for any pair of vertices. 


The above framework allows a straightforward paralleliza- 
tion of both the indexing and the query: the computation of 
independent index databases can be performed on up to N 
different machines. Then the databases are transferred to 
the backend computers that serve the query requests. When 
a request arrives to the frontend server, it asks all (up to N) 
backend servers, averages their answers and returns the re- 
sults to the user. 

The Monte Carlo parallelization scheme has many ad- 
vantages that make it perfectly suitable to large-scale web 
search engines: 

Fault tolerance. If one or more backend servers cannot re- 
spond to the query in time, then the frontend can aggregate 
the results of the remaining ones and calculate the estimate 
from the available answers. This will not influence service 
availability, and results in a slight loss of precision. 

Load balancing. In case of very high query loads, more 
than N backend servers (database servers) can be employed. 
A simple solution is to replicate the individual index data- 
bases. Better results are achieved if one calculates an inde- 
pendent index database for all the backend servers. In this 
case it suffices to ask any N backend servers for a proper 
precision answer. This allows seamless load balancing, i.e., 
you can add more backend servers one-by-one as the demand 
increases. 

Furthermore, this parallelization allows dynamic adapta- 
tion to workload. During times of excessive load the number 
of backend servers asked for each query (N) can be auto- 
matically reduced to maintain fast response times and thus 
service integrity. Meanwhile, during idle periods, this value 
can be increased to get higher precision for free (along with 
better utilization of resources). We believe that this feature 
is extremely important in the applicability of our results. 


4. ERROR OF APPROXIMATION 


As we have seen in earlier sections, a crucial parameter 
of our methods is the number N of fingerprints. The index 
database size, indexing time, query time and database ac- 
cesses are all linear in N. In this section we formally analyze 
the number of fingerprints needed for a given precision ap- 
proximation. Our theorems show that even a modest num- 
ber of fingerprints (e.g., N = 100) suffices for the purposes 
of a web search engine. 

To state our results we need a suitably general model that 
can accommodate our methods for SimRank, PSimRank and 
XJaccard. Suppose that a Monte Carlo algorithm assigns N 
independent sets of fingerprints for the vertices and for any 
pair u, v the similarity function sim(u, v) equals the expected 


value of the similarities of the fingerprints. The similarities 
of the fingerprints are calculated by a function that maps 
any pair of fingerprints to a similarity score in range [0,1]. 
The similarity function estimated by averaging the similar- 
ities of N sets of fingerprints will be referred to as a Monte 
Carlo similarity function and it will be denoted by sim(.,-). 
Naturally, E(sim(u,v)) = sim(u,v) holds. Notice that the 
approximate scores of our algorithms for SimRank, PSim- 
Rank and XJaccard can all be regarded as sim(-,-) Monte 
Carlo similarity functions. A more general model is defined 
in the full version [13] of this paper. 


THEOREM 4. For any Monte Carlo similarity function 
sim the absolute error converges to zero exponentially in the 
number of fingerprints N and uniformly over the pair of ver- 
tices u,v. More precisely, for any vertices u,v and any 6 > 0 
we have 


6 2 
Pr{|Sim(u, v) — sim(u, v)| > 5} < 2e7 7%? 

Notice that the bound uniformly applies to all graphs and 
all similarity functions, such as SimRank, PSimRank and 
XJaccard. However, this bound concerns the convergence of 
the similarity score for one pair of vertices only. In the web 
search scenario, we typically use related queries, thus are 
interested in the relative order of pages according to their 
similarity to a given query page u. 


THEOREM 5. For any Monte Carlo similarity function 
sim and any fixed item u, the probability of interchanging two 
items in the similarity ranking of page u converges to zero ex- 
ponentially in the number of fingerprints N. More precisely, 
for each page v and w, such that sim(u,v) > sim(u,w) we 
have 

2 
Pr{gim(u,v) < sim(u, w)} < e 03%? 
where 6 = sim(u,v) — sim(u, w). 


This theorem implies that the Monte Carlo approxima- 
tion can efficiently capture the big differences among the 
similarity scores. But when it comes to small differences, 
then the error of approximation obscures the actual similar- 
ity ranking, and an almost arbitrary reordering is possible. 
We believe, that for a web search inspired similarity ranking 
it is sufficient to distinguish between very similar, modestly 
similar, and dissimilar pages. We can formulate this require- 
ment in terms of a slightly weakened version of classical in- 
formation retrieval measures precision and recall [1]. 

Consider a related query for page u with similarity thresh- 
old a, i.e., the problem is to return the set of pages S = {v : 
sim(u,v) > a}. Our methods approximate this set with $ = 
{v : sim(u, v) > a}. We weaken the notion of precision and 
recall to exclude a small, 6 sized interval of similarity scores 
around the threshold a: let S:5 = {v : sim(u,v) > a+ ô}, 
S_5 = {v : sim(u,v) > a — ô}. Then the expected 6-recall 


of a Monte Carlo similarity function is ECSS+5) while the 


a [S48] 
E(|SOS_5|) 


Sas) We denote by Ss the 


expected 6-precision is 


complement set of S_5. 


THEOREM 6. For any Monte Carlo similarity function 
sim, any page u, similarity threshold a and 6 > 0 the ex- 
pected 6-recall is at least 


6 2 
-Nô 
l-e 7 
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and the expected 6-precision is at least 


lal i 
[S46] er Na? —1 


This theorem shows, that the expected 6-recall converges 
to 1 exponentially and uniformly over all possible similarity 
functions, graphs and queried vertices of the graphs, while 
the expected -precision converges to 1 exponentially for any 
fixed similarity function, graph and queried node. 


5. EXPERIMENTS 


This section presents our experiments on the repository 
of 80M pages crawled by the Stanford WebBase project in 
2001. The following problems are addressed by our experi- 
ments: 


e How do the parameters ¢, N and c effect the quality of the 
similarity search algorithms? The dependence on path 
length £ show that multi-step neighborhoods of pages con- 
tain more valuable similarity information than single-step 
neighborhoods for up to £ & 5. 


e How do the qualities of SimRank, PSimRank and XJac- 
card relate to each other? We conclude that PSimRank 
outperforms all the other methods. 


e What are the average and maximal sizes of fingerprint 
trees for SimRank and PSimRank? Recall that the run- 
ning time and memory requirement of query algorithms 
are proportional to these sizes. We measured sizes as 
small as 100 — 200 on average implying fast running time 
with low memory requirement. 


5.1 Measuring the Quality of Similarity Scores 


We briefly recall the method of Haveliwala et al. [16] to 
measure the quality of similarity search algorithms. 

The similarity search algorithms will be compared to a 
ground truth similarity ordering extracted from the Open 
Directory Project (ODP, [26]) data, a hierarchical collection 
of webpages. The ODP category tree implicitly encodes 
the similarity information, which can be decoded as follows. 
The ODP tree is collapsed into a fixed depth, such that the 
leaves contain the classes of documents (urls). Given a page 
u the rest of the documents fall into the same class as u, 
a sibling class, a cousin class, etc. This induces a partial 
ordering of the documents, which will be referred to as the 
familial ordering with respect to u. The key assumption is 
that the true similarity to a page u decreases monotonically 
with the familial ordering. 

Intuitively we want to express the expected quality of a 
similarity ordering to a query page u in comparison with 
the familial ordering of u, where u is chosen uniformly at 
random. The two orderings are compared by the Kruskal- 
Goodman IT measure that gives score +1 to a pair v, w 
if the two orderings agree on the similarity ordering of the 
pair, and it gives —1 if they order the pair reversely. As both 
orderings are partial, the [ value is defined as the average of 
scores over all pairs that are comparable by both orderings. 
To obtain a more precise measure focusing on the top region 
of the familial ordering, sibling measure [16] restricts the 
averaging to vertices that either fall into the same or a sibling 
class of u. 

We refer to [13] for subtle differences between our mea- 
surements and the sibling T defined in [16]. 


5.2 Comparing the Qualities of the Methods 
with Various Parameter Settings 


All the experiments were performed on a web graph of 
78,636,371 pages crawled and parsed by the Stanford Web- 
Base project in 2001. In our copy of the ODP tree 218,720 
urls were found falling into 544 classes after collapsing the 
tree. The indexing process took 4 hours for SimRank, 14 
hours for PSimRank and 27 hours for extended Jaccard co- 
efficient with path length £ = 10 and N = 100 fingerprints. 
We ran a semi-external memory implementation on a single 
machine with 2.8GHz Intel Pentium 4 processor, 2Gbytes 
main memory and Linux OS. The total size of the computed 
database was 68Gbytes for (P)SimRank and 640Gbytes for 
XJaccard. Since sibling [ is based on similarity scores be- 
tween vertices of the ODP pages, we only saved the fin- 
gerprints of the 218,720 ODP pages. A nice property of 
our methods is that this truncation (resulting in sizes of 
200Mbytes and 1.8Gbytes respectively) does not affect the 
returned scores for the ODP pages. 

The results of the experiments are depicted on Fig. 3. Re- 
call that sibling I expresses the average quality of similarity 
search algorithms with I values falling into the range [—1, 1]. 
The extreme [ = 1 result would show that similarity scores 
completely agree with the ground truth similarities, while 
T = —1 would show the opposite. Our I = 0.3 — 0.4 val- 
ues imply that our algorithms agree with the ODP familial 
ordering in 65 — 70% of the pairs. 

The radically increasing [ values for path length £ = 
1,2,3,4 on the top diagram supports our basic assumption 
that the multi-step neighborhoods of pages contain valu- 
able similarity information. The quality slightly increases 
for larger values of @ in case of PSimRank and SimRank, 
while sibling [ has maximum value for € = 4 in case of 
XJaccard. Notice the difference between the scale of the 
top diagram and the scales of the other two diagrams. 

The middle diagram shows the tendency that the quality 
of similarity search can be increased by smaller decay fac- 
tor. This phenomenon suggests that we should give higher 
priority to the similarity information collected in smaller 
distances and rely on long-distance similarities only if nec- 
essary. The bottom diagram depicts the changes of I as 
a function of the number N of fingerprints. The diagram 
shows slight quality increase as the estimated similarity scores 
become more precise with larger values of N. 

Finally, we conclude from all the three diagrams that 
PSimRank scores introduced in Section 2.2 outperform all 
the other similarity search algorithms. 


5.3 Time and memory requirement 
of fingerprint tree queries 


Recall from Section 2.1.2 that for SimRank and PSim- 
Rank queries N fingerprint trees are loaded and traversed. 
N can be easily increased with Monte Carlo parallelization, 
but the sizes of fingerprint trees may be as large as the 
number V of vertices. This would require both memory and 
running time in the order of V, and thus violate the re- 
quirements of Section 1.2. The experiments verify that this 
problem does not occur in case of real web data. 

Fig. 4 shows the growing sizes of fingerprint trees as a 
function of path length £in databases containing fingerprints 
for all vertices of the Stanford WebBase graph. Recall that 
the trees are growing when random walks meet and the cor- 
responding trees join into one tree. It is not surprising that 
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Figure 3: Varying algorithm parameters indepen- 
dently with default settings £ = 10 for SimRank and 
PSimRank ¢ = 4 for XJaccard, c= 0.1, and N = 100. 


the tree sizes of PSimRank exceed that of SimRank, since 
the correlated random walks meet each other with higher 
probabilities than the independent walks of SimRank. 

We conclude from the lower curves of Fig. 4 that the aver- 
age tree sizes read for a query vertex is approximately 100- 
200, thus the algorithm performs like an external-memory 
algorithm on average in case of our web graph. Even the 
largest fingerprint trees have no more than 10-20K vertices, 
which is still very small compared to the 80M pages. 
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Figure 4: Fingerprint tree sizes for 80M pages with 
N = 100 samples. 


6. CONCLUSION 


We introduced the framework of link-based Monte Carlo 
similarity search to achieve scalable algorithms for similar- 
ity functions evaluated from the multi-step neighborhoods 
of web pages. Within this framework, we presented the first 
algorithm to approximate SimRank scores with a near linear 
external memory method and parallelization techniques suf- 
ficient for large scale computation. In addition, we defined 
new similarity functions PSimRank and the extended Jac- 
card coefficient with scalable algorithms. Our experiments 
conducted on the Stanford WebBase graph of 80M pages 
demonstrate scalability and suggest that PSimRank outper- 
forms SimRank and extended Jaccard coefficient in terms of 
quality. 
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